Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refacto dataset version query #2683

Conversation

sophiely
Copy link
Contributor

@sophiely sophiely commented Nov 15, 2023

Problem

The SQL query run for dataset version is quite slow and often result in a time out.

Closes: 2684

Solution

In this query the JSONB_AGG operation is the task that takes the most time, so I just re-write the query by putting the "heavy operation" (JSONB_AGG) after all JOIN and filters.

For a given namespace and dataset name and a db.t4g.medium (vCPU: 2, RAM: 4 GB) machine:

  • The old query takes: 4 minutes 21 seconds
  • The new query takes: 1.230 seconds
WITH dataset_info AS (
	    SELECT d.type, d.name, d.physical_name, d.namespace_name, d.source_name, d.description, dv.lifecycle_state,
		dv.created_at, dv.version, dv.fields, dv.run_uuid AS createdByRunUuid, sv.schema_location,
		t.tags, f.facets, f.lineage_event_time, f.dataset_version_uuid, facet_name
		FROM dataset_versions dv
		LEFT JOIN datasets_view d ON d.uuid = dv.dataset_uuid
		LEFT JOIN stream_versions AS sv ON sv.dataset_version_uuid = dv.uuid
		LEFT JOIN (
			SELECT ARRAY_AGG(t.name) AS tags, m.dataset_uuid
			FROM tags AS t
			INNER JOIN datasets_tag_mapping AS m ON m.tag_uuid = t.uuid
			GROUP BY m.dataset_uuid
		) t ON t.dataset_uuid = dv.dataset_uuid
		LEFT JOIN (
			SELECT
				dataset_version_uuid,
				name as facet_name,
				facet as facets,lineage_event_time
			FROM dataset_facets_view
			WHERE
				(type ILIKE 'dataset' OR type ILIKE 'unknown')
      	) f ON f.dataset_version_uuid = dv.uuid
      	WHERE dv.namespace_name = :namespaceName
            AND dv.dataset_name = :datasetName
      	ORDER BY dv.created_at DESC
      	LIMIT :limit OFFSET :offset
        )
        SELECT
	        type, name, physical_name, namespace_name, source_name, description, lifecycle_state,
            created_at, version, fields, createdByRunUuid, schema_location,
            tags, dataset_version_uuid,
	        JSONB_AGG(facets ORDER BY lineage_event_time ASC) AS facets
        FROM dataset_info
        GROUP BY type, name, physical_name, namespace_name, source_name, description, lifecycle_state,
            created_at, version, fields, createdByRunUuid, schema_location,
            tags, dataset_version_uuid

Checklist

  • You've signed-off your work
  • Your changes are accompanied by tests (if relevant)
  • Your change contains a small diff and is self-contained
  • You've updated any relevant documentation (if relevant)
  • You've included a one-line summary of your change for the CHANGELOG.md (Depending on the change, this may not be necessary).
  • You've versioned your .sql database schema migration according to Flyway's naming convention (if relevant)
  • You've included a header in any source code files (if relevant)

@boring-cyborg boring-cyborg bot added the api API layer changes label Nov 15, 2023
Copy link

netlify bot commented Nov 15, 2023

Deploy Preview for peppy-sprite-186812 canceled.

Name Link
🔨 Latest commit 8230d17
🔍 Latest deploy log https://app.netlify.com/sites/peppy-sprite-186812/deploys/6555d05c77607900083d22d6

Copy link

codecov bot commented Nov 16, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (60d7d90) 84.05% compared to head (8230d17) 84.05%.

Additional details and impacted files
@@            Coverage Diff            @@
##               main    #2683   +/-   ##
=========================================
  Coverage     84.05%   84.05%           
  Complexity     1379     1379           
=========================================
  Files           248      248           
  Lines          6297     6297           
  Branches        286      286           
=========================================
  Hits           5293     5293           
  Misses          851      851           
  Partials        153      153           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@sophiely sophiely force-pushed the feat/refacto-dataset-version-sql-query branch from 659040f to 9503c89 Compare November 16, 2023 08:09
@sophiely sophiely force-pushed the feat/refacto-dataset-version-sql-query branch from 9503c89 to 8230d17 Compare November 16, 2023 08:18
Copy link
Collaborator

@pawel-big-lebowski pawel-big-lebowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work!

@pawel-big-lebowski pawel-big-lebowski merged commit dddac31 into MarquezProject:main Nov 16, 2023
16 checks passed
Copy link

boring-cyborg bot commented Nov 16, 2023

Great job! Congrats on your first merged pull request in the Marquez project!

@wslulciuc wslulciuc added this to the 0.43.0 milestone Dec 13, 2023
@sophiely sophiely deleted the feat/refacto-dataset-version-sql-query branch July 25, 2024 07:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api API layer changes
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

[PERF] Dataset Version query Time out
3 participants